We are now entering the phase of doing serious statistics: we run regression models, inspect them, plot, and tabulate them.
For this exercise, let’s pretend to be researchers interested in the determinants of the subjective risk of being treated for COVID-19 in a hospital. We suspect that age plays an important role, something we have already assessed by visualizing a similar relationship. So… we need the GESIS Panel COVID-19 survey data again:
library(dplyr)
library(haven)
library(sjlabelled)
gp_covid <-
read_sav(
"../data/ZA5667_v1-1-0.sav"
) %>%
set_na(na = c(-1:-99, 97)) %>%
mutate(
likelihood_hospital = hzcy003a,
age_cat = as.factor(age_cat)
) %>%
remove_all_labels()
We start our analysis by checking the relationship with an ANOVA. We also include some very essential covariates: gender, political orientation, and marital status.
+.
Alright, there’s definitely something in the data we should investigate. Oftentimes this is considered bad practice, but we could try to create a median split for the dependent variable and run a logistic regression.
likelihood_hospital and run a logistic regression with all variables. What do you see in the results?
dplyr::mutate() in combination with the ifelse() function.
Running one model is not enough. It may be that the assumptions for logistic regression are not fulfilled, or a reviewer simply doesn’t like these types of regressions. Instead, she proposes to run a binomial regression but with a cauchit link.
?family to see how you can include a cauchit link.
Using different link functions is sometimes done as they provide different model fits. That’s definitely something we should investigate as well.
test = "LRT" to perform a likelihood ratio test. What’s your interpretation?